Goto

Collaborating Authors

 generalized autoregressive pretraining


XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation.


XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.


Reviews: XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems

Originality: The architecture is novel compare to recent lines of language model work, which all used variation of BERT or GPT (SciBERT, MT-DNN, MASS and etc). The example ("New York is a city" one) makes sense, but considering the permutation is random when computing the objective function, I still couldn't get why it works better than sequential order because human speaks/writes in sequential order. Could you add more intuitions in paper? Or have you tried predicting n-gram, compare to permutation? Quality: Very high considering they did extensive of studies on multiple benchmarks, also the ablation study is nicely done as well.


Reviews: XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems

The paper proposes XLNet, a generalized autoregressive pretraining method for language representation learning. The paper shows that XLNet outperforms the state of the art method of BERT on 12 tasks. The paper is of high quality in terms of clarity, technical soundness, significance, and novelty. The authors successfully addressed the issues pointed out by the reviewers. The reviewers are very satisfied with the response.


XLNet: Generalized Autoregressive Pretraining for Language Understanding

Neural Information Processing Systems

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.


XLNet: Generalized Autoregressive Pretraining for Language Understanding

Yang, Zhilin, Dai, Zihang, Yang, Yiming, Carbonell, Jaime, Salakhutdinov, Russ R., Le, Quoc V.

Neural Information Processing Systems

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking. Papers published at the Neural Information Processing Systems Conference.